public class Lexer
extends java.lang.Object
Given a file stream fp it returns a sequence of tokens. GetToken(fp) gets the next token UngetToken(fp) provides one level undo The tags include an attribute list: - linked list of attribute/value nodes - each node has 2 null-terminated strings. - entities are replaced in attribute values white space is compacted if not in preformatted mode If not in preformatted mode then leading white space is discarded and subsequent white space sequences compacted to single space chars. If XmlTags is no then Tag names are folded to upper case and attribute names to lower case. Not yet done: - Doctype subset and marked sections
| Modifier and Type | Field | Description |
|---|---|---|
protected short |
badAccess |
for accessibility errors.
|
protected short |
badChars |
for bad char encodings.
|
protected boolean |
badDoctype |
set if html or PUBLIC is missing.
|
protected short |
badForm |
for mismatched/mispositioned form tags.
|
protected short |
badLayout |
for bad style errors.
|
protected int |
columns |
at start of current token.
|
protected Configuration |
configuration |
configuration.
|
protected int |
doctype |
version as given by doctype (if any).
|
protected short |
errors |
count of errors.
|
protected java.io.PrintWriter |
errout |
error output stream.
|
protected boolean |
excludeBlocks |
Netscape compatibility.
|
protected boolean |
exiled |
true if moved out of table.
|
static short |
IGNORE_MARKUP |
state: ignore markup.
|
static short |
IGNORE_WHITESPACE |
state: ignore whitespace.
|
protected StreamIn |
in |
file stream.
|
protected Node |
inode |
Inline stack for compatibility with Mosaic.
|
protected int |
insert |
for inferring inline tags.
|
protected boolean |
insertspace |
when space is moved after end tag.
|
protected java.util.Stack |
istack |
stack.
|
protected int |
istackbase |
start of frame.
|
protected boolean |
isvoyager |
true if xmlns attribute on html element.
|
protected byte[] |
lexbuf |
Lexer character buffer parse tree nodes span onto this buffer which contains the concatenated text contents of
all of the elements.
|
protected int |
lexlength |
allocated.
|
protected int |
lexsize |
used.
|
protected int |
lines |
lines seen.
|
static short |
MIXED_CONTENT |
state: mixed content.
|
static short |
PREFORMATTED |
state: preformatted.
|
protected boolean |
pushed |
true after token has been pushed back.
|
protected Report |
report |
report.
|
protected Node |
root |
Root node is saved here.
|
protected boolean |
seenEndBody |
already seen end body tag?
|
protected boolean |
seenEndHtml |
already seen end html tag?
|
protected short |
state |
state of lexer's finite state machine.
|
protected Style |
styles |
used for cleaning up presentation markup.
|
protected Node |
token |
current node.
|
protected int |
txtend |
end of current node.
|
protected int |
txtstart |
start of current node.
|
protected short |
versions |
bit vector of HTML versions.
|
protected short |
warnings |
count of warnings in this document.
|
protected boolean |
waswhite |
used to collapse contiguous white space.
|
| Constructor | Description |
|---|---|
Lexer(StreamIn in,
Configuration configuration,
Report report) |
Instantiates a new Lexer.
|
| Modifier and Type | Method | Description |
|---|---|---|
void |
addByte(int c) |
Adds a byte to lexer buffer.
|
void |
addCharToLexer(int c) |
Store char c as UTF-8 encoded byte stream.
|
boolean |
addGenerator(Node root) |
Add meta element for Tidy.
|
void |
addStringLiteral(java.lang.String str) |
calls addCharToLexer for any char in the string.
|
void |
addStringToLexer(java.lang.String str) |
Adds a string to lexer buffer.
|
short |
apparentVersion() |
Return the html version used in document.
|
boolean |
canPrune(Node element) |
Can the given element be removed?
|
void |
changeChar(byte c) |
Substitute the last char in buffer.
|
boolean |
checkDocTypeKeyWords(Node doctype) |
Check system keywords (keywords should be uppercase).
|
AttVal |
cloneAttributes(AttVal attrs) |
Clones an attribute value and add eventual asp or php node to node list.
|
Node |
cloneNode(Node node) |
Clones a node and add it to node list.
|
void |
deferDup() |
Defer duplicates when entering a table or other element where the inlines shouldn't be duplicated.
|
boolean |
endOfInput() |
Has end of input stream been reached?
|
short |
findGivenVersion(Node doctype) |
Examine DOCTYPE to identify version.
|
boolean |
fixDocType(Node root) |
Fixup doctype if missing.
|
void |
fixHTMLNameSpace(Node root,
java.lang.String profile) |
Fix xhtml namespace.
|
void |
fixId(Node node) |
duplicate name attribute as an id and check if id and name match.
|
boolean |
fixXmlDecl(Node root) |
Ensure XML document starts with
<?XML version="1.0"?>. |
Node |
getCDATA(Node container) |
Create a text node for the contents of a CDATA element like style or script which ends with </foo> for some
foo.
|
Node |
getToken(short mode) |
Gets a token.
|
short |
htmlVersion() |
Choose what version to use for new doctype.
|
java.lang.String |
htmlVersionName() |
Choose what version to use for new doctype.
|
Node |
inferredTag(java.lang.String name) |
Generates and inserts a new node.
|
int |
inlineDup(Node node) |
This has the effect of inserting "missing" inline elements around the contents of blocklevel elements such as P,
TD, TH, DIV, PRE etc.
|
Node |
insertedToken() |
|
static boolean |
isCSS1Selector(java.lang.String buf) |
In CSS1, selectors can contain only the characters A-Z, 0-9, and Unicode characters 161-255, plus dash (-); they
cannot start with a dash or a digit; they can also contain escaped characters and any Unicode character as a
numeric code (see next item).
|
boolean |
isPushed(Node node) |
Is the node in the stack?
|
static boolean |
isValidAttrName(java.lang.String attr) |
Check if attr is a valid name.
|
Node |
newLineNode() |
Adds a new line node.
|
Node |
newNode() |
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end) |
Creates a new node and add it to nodelist.
|
Node |
newNode(short type,
byte[] textarray,
int start,
int end,
java.lang.String element) |
Creates a new node and add it to nodelist.
|
Node |
parseAsp() |
parser for ASP within start tags Some people use ASP for to customize attributes Tidy isn't really well suited to
dealing with ASP This is a workaround for attributes, but won't deal with the case where the ASP is used to
tailor the attribute value.
|
java.lang.String |
parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php) |
consumes the '>' terminating start tags.
|
AttVal |
parseAttrs(boolean[] isempty) |
Parse tag attributes.
|
void |
parseEntity(short mode) |
Parse an html entity.
|
Node |
parsePhp() |
PHP is like ASP but is based upon XML processing instructions, e.g.
|
int |
parseServerInstruction() |
Invoked when < is seen in place of attribute value but terminates on whitespace if not ASP, PHP or Tango this
routine recognizes ' and " quoted strings.
|
char |
parseTagName() |
Parses a tag name.
|
java.lang.String |
parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim) |
Parse an attribute value.
|
void |
popInline(Node node) |
Pop a copy of an inline node from the stack.
|
protected boolean |
preContent(Node node) |
Is content acceptable for pre elements?
|
void |
pushInline(Node node) |
Push a copy of an inline node onto stack but don't push if implicit or OBJECT or APPLET (implicit tags are ones
generated from the istack) One issue arises with pushing inlines when the tag is already pushed.
|
boolean |
setXHTMLDocType(Node root) |
Adds a new xhtml doctype to the document.
|
void |
ungetToken() |
|
protected void |
updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray) |
Update
oldtextarray in the current nodes. |
public static final short IGNORE_WHITESPACE
public static final short MIXED_CONTENT
public static final short PREFORMATTED
public static final short IGNORE_MARKUP
protected StreamIn in
protected java.io.PrintWriter errout
protected short badAccess
protected short badLayout
protected short badChars
protected short badForm
protected short warnings
protected short errors
protected int lines
protected int columns
protected boolean waswhite
protected boolean pushed
protected boolean insertspace
protected boolean excludeBlocks
protected boolean exiled
protected boolean isvoyager
protected short versions
protected int doctype
protected boolean badDoctype
protected int txtstart
protected int txtend
protected short state
protected Node token
protected byte[] lexbuf
protected int lexlength
protected int lexsize
protected Node inode
protected int insert
protected java.util.Stack istack
protected int istackbase
protected Style styles
protected Configuration configuration
protected boolean seenEndBody
protected boolean seenEndHtml
protected Report report
protected Node root
public Lexer(StreamIn in, Configuration configuration, Report report)
in - StreamInconfiguration - configuation instancereport - report instance, for reporting errorspublic Node newNode()
public Node newNode(short type, byte[] textarray, int start, int end)
type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray - array of bytes contained in the Nodestart - start positionend - end positionpublic Node newNode(short type, byte[] textarray, int start, int end, java.lang.String element)
type - node type: Node.ROOT_NODE | Node.DOCTYPE_TAG | Node.COMMENT_TAG | Node.PROC_INS_TAG | Node.TEXT_NODE |
Node.START_TAG | Node.END_TAG | Node.START_END_TAG | Node.CDATA_TAG | Node.SECTION_TAG | Node. ASP_TAG |
Node.JSTE_TAG | Node.PHP_TAG | Node.XML_DECLtextarray - array of bytes contained in the Nodestart - start positionend - end positionelement - tag namepublic Node cloneNode(Node node)
node - Nodepublic AttVal cloneAttributes(AttVal attrs)
attrs - original AttValprotected void updateNodeTextArrays(byte[] oldtextarray,
byte[] newtextarray)
oldtextarray in the current nodes.oldtextarray - previous text arraynewtextarray - new text arraypublic Node newLineNode()
public boolean endOfInput()
true if end of input stream been reachedpublic void addByte(int c)
c - byte to addpublic void changeChar(byte c)
c - new charpublic void addCharToLexer(int c)
c - char to storepublic void addStringToLexer(java.lang.String str)
str - String to addpublic void parseEntity(short mode)
mode - modepublic char parseTagName()
public void addStringLiteral(java.lang.String str)
str - input Stringpublic short htmlVersion()
public java.lang.String htmlVersionName()
public boolean addGenerator(Node root)
root - root nodetrue if the tag has been addedpublic boolean checkDocTypeKeyWords(Node doctype)
doctype - doctype nodepublic short findGivenVersion(Node doctype)
doctype - doctype nodepublic void fixHTMLNameSpace(Node root, java.lang.String profile)
root - root Nodeprofile - current profilepublic boolean setXHTMLDocType(Node root)
root - root nodetrue if a doctype has been addedpublic short apparentVersion()
public boolean fixDocType(Node root)
root - root nodefalse if current version has not been identifiedpublic boolean fixXmlDecl(Node root)
<?XML version="1.0"?>. Add encoding attribute if not using
ASCII or UTF-8 output.root - root nodepublic Node inferredTag(java.lang.String name)
name - tag namepublic Node getCDATA(Node container)
container - container nodepublic void ungetToken()
public Node getToken(short mode)
mode - one of the following:
MixedContent-- for elements which don't accept PCDATAPreformatted-- white spacepreserved as isIgnoreMarkup-- for CDATA elements such as script, stylepublic Node parseAsp()
href='<%=rsSchool.Fields("ID").Value%>' where the ASP that generates the attribute value is
masked from Tidy by the quotemarks.public Node parsePhp()
<?php ... ?>.public java.lang.String parseAttribute(boolean[] isempty,
Node[] asp,
Node[] php)
isempty - flag is passed as array so it can be modifiedasp - asp Node, passed as array so it can be modifiedphp - php Node, passed as array so it can be modifiedpublic int parseServerInstruction()
public java.lang.String parseValue(java.lang.String name,
boolean foldCase,
boolean[] isempty,
int[] pdelim)
name - attribute namefoldCase - fold case?isempty - is attribute empty? Passed as an array reference to allow modificationpdelim - delimiter, passed as an array reference to allow modificationpublic static boolean isValidAttrName(java.lang.String attr)
attr - String to check, must be non-nulltrue if attr is a valid name.public static boolean isCSS1Selector(java.lang.String buf)
buf - css selector nametrue if the given string is a valid css1 selector namepublic AttVal parseAttrs(boolean[] isempty)
isempty - is tag empty?public void pushInline(Node node)
<p><em> text <p><em> more text Shouldn't be mapped to
<p><em> text </em></p><p><em><em> more text </em></em>node - Node to be pushedpublic void popInline(Node node)
node - Node to be poppedpublic boolean isPushed(Node node)
node - Nodetrue is the node is found in the stackpublic int inlineDup(Node node)
<i><h1>italic heading</h1></i> which is then treated as
equivalent to <h1><i>italic heading</i></h1> This is implemented by setting the lexer
into a mode where it gets tokens from the inline stack rather than from the input stream.node - original nodepublic Node insertedToken()
public boolean canPrune(Node element)
element - nodetrue if he element can be removedpublic void fixId(Node node)
node - Node to check for name/it attributespublic void deferDup()
protected boolean preContent(Node node)
node - contenttrue if node is acceptable in pre elements