Clan x86

Technical (Development, Security, etc.) => General Programming => Topic started by: Camel on September 22, 2007, 03:19:09 AM

Title: [Java] URL detection
Post by: Camel on September 22, 2007, 03:19:09 AM
Okay, so I've got this function method:

http://bnubot.googlecode.com/svn/trunk/BNUBot/src/net/bnubot/bot/gui/components/TextWindow.java
private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*)(\\b(http://|https://|www.|ftp://|file:/|mailto:)\\S+)(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(4));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}


And it seems to work alright some of the time, but not all of the time. Particularly, it doesn't like to find URLs that aren't the first thing in the input string, and it likes to turn http://www... in to http://<a href="...">www...</a>. I'm fairly confident that it's an issue with the regexp, but I'm not exactly sure what's wrong or how to fix it.
Title: Re: [Java] URL detection
Post by: Sidoh on September 22, 2007, 02:07:14 PM
That's a pretty liberal regex for a URL, isn't it?  I mean it would match http://hi.hellotherehowareyoutoday (lol, so does SMF) or www.$%%%#@$%@#$@!#$@#$.

The http://<a href="..." problem is happening because you're matching http:// or www. and I don't think \b works in the way you're expecting it to.

Since I'm pretty rusty on regex, I tried to make one.  It seems to work pretty well too, but I haven't tested it very extensively.

Maybe this will help ya?

public static void main(String[] args)
{
Pattern url = Pattern.compile(
  "(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);

String[] tests = {
"www.google.com/howareyoutoday/",
"http://www.google.com/asdf",
"http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new",
"http://www.google.com/ig?hl=en",
"wwhatwww.doyouwant.com",
"www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search"
};

for(int i = 0; i < tests.length; i++)
{
Matcher match = url.matcher(tests[i]);

if(match.matches())
{
System.out.println("Match");

System.out.println(""
+ "<a href=\"" + (match.group(3).equals("www.") ? "http://" : "") + match.group(2) + "\">" +
match.group(2) + "</a>");
}
else
{
System.out.println("No match");
}
}
}


Output

Match
<a href="http://www.google.com/howareyoutoday/">www.google.com/howareyoutoday/</a>
Match
<a href="http://www.google.com/asdf">http://www.google.com/asdf</a>
Match
<a href="http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new">http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new</a>
Match
<a href="http://www.google.com/ig?hl=en">http://www.google.com/ig?hl=en</a>
No match
Match
<a href="http://www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search">www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search</a>


Oh, this doesn't work for mailto: things, but I don't think it makes too much sense to look for those anyway.  Wouldn't it make more sense to have another regex that searched for email addresses?  In any case, it could be modified pretty easily to include it.
Title: Re: [Java] URL detection
Post by: Camel on September 23, 2007, 06:09:35 PM
Thanks Sidoh, that the http prefix problem.

private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
+ safeHtml(matcher.group(10));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}


It still isn't picking URLs out of my MOTD string, though. I'll have to do some more debugging, but I think it has something to do with a newline immediately prefixing http://.
Title: Re: [Java] URL detection
Post by: Sidoh on September 23, 2007, 06:49:06 PM
Oh, that's right.  Good call.  I totally forgot "." is everything but newline and I don't think \w (which is referenced by \b) matches linebreaks.  I might be wrong on that, but messing around with things seems to suggest it.

Anyway, this little fix should take care of that:


...
"\nwww.google.com/howareyoutoday/",
...
Pattern url = Pattern.compile(
  "((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
);
...
System.out.println(""
+ "<a href=\"" + (match.group(4).equals("www.") ? "http://" : "") + match.group(3) + "\">" +
match.group(3) + "</a>");


There might be something else
Title: Re: [Java] URL detection
Post by: Camel on September 23, 2007, 07:15:53 PM
Cool, that works. I did the same thing to the trailer, as well. It still misses trailing slashes, but I don't care since it still works.

private static Pattern pattern = null;
public String safeHtml(String in) {
if(pattern == null)
pattern = Pattern.compile("((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))((.|\n)*)");
Matcher matcher = pattern.matcher(in);

if(matcher.matches())
return safeHtml(matcher.group(1))
+ "<a href=\"" + matcher.group(3) + "\">" + matcher.group(3) + "</a>"
+ safeHtml(matcher.group(11));
return in
.replaceAll("&", "&amp;")
.replaceAll("<", "&lt;")
.replaceAll(">", "&gt;")
.replaceAll("\n", "<br>\n")
.replaceAll("  ", " &nbsp;");
}
Title: Re: [Java] URL detection
Post by: Joe on September 24, 2007, 05:59:18 PM
"wwhatwww.doyouwant.com",

That's totally a valid URL.
Title: Re: [Java] URL detection
Post by: Sidoh on September 24, 2007, 06:19:23 PM
Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Yes, but the idea is that you ignore strings that aren't almost certainly URLs (meaning they're prefixed with one of www. or a protocol://).

If you were building a regular expression to validate URLs, then you'd want to include ones like that, but you'd also want to be much more conservative, explicitly specifying protocols, .extensions, etc.
Title: Re: [Java] URL detection
Post by: Camel on September 24, 2007, 07:22:36 PM
Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Just because your browser's address bar knows what you mean, you can't conclude that it's a URL.

That is not, in fact, a URL, because it does not give you enough information to know what service it provides, or how to connect to it.