[Java] URL detection

Camel · September 22, 2007, 03:19:09 AM

Okay, so I've got this ~~function~~ method:

http://bnubot.googlecode.com/svn/trunk/BNUBot/src/net/bnubot/bot/gui/components/TextWindow.java

Code Select

	private static Pattern pattern = null;
	public String safeHtml(String in) {
		if(pattern == null)
			pattern = Pattern.compile("(.*)(\\b(http://|https://|www.|ftp://|file:/|mailto:)\\S+)(.*)");
		Matcher matcher = pattern.matcher(in); 
		
		if(matcher.matches())
			return safeHtml(matcher.group(1))
				+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
				+ safeHtml(matcher.group(4));
		return in
			.replaceAll("&", "&amp;")
			.replaceAll("<", "&lt;")
			.replaceAll(">", "&gt;")
			.replaceAll("\n", "<br>\n")
			.replaceAll("  ", " &nbsp;");
	}

And it seems to work alright some of the time, but not all of the time. Particularly, it doesn't like to find URLs that aren't the first thing in the input string, and it likes to turn http://www... in to http://<a href="...">www...</a>. I'm fairly confident that it's an issue with the regexp, but I'm not exactly sure what's wrong or how to fix it.

Sidoh · September 22, 2007, 02:07:14 PM

That's a pretty liberal regex for a URL, isn't it? I mean it would match http://hi.hellotherehowareyoutoday (lol, so does SMF) or www.$%%%#@$%@#$@!#$@#$.

The http://<a href="..." problem is happening because you're matching http:// or www. and I don't think \b works in the way you're expecting it to.

Since I'm pretty rusty on regex, I tried to make one. It seems to work pretty well too, but I haven't tested it very extensively.

Maybe this will help ya?

Code Select

	public static void main(String[] args)
	{
		Pattern url = Pattern.compile(
		  "(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
		);
		
		String[] tests = {
			"www.google.com/howareyoutoday/",
			"http://www.google.com/asdf",
			"http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new",
			"http://www.google.com/ig?hl=en",
			"wwhatwww.doyouwant.com",
			"www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search"
		};
		
		for(int i = 0; i < tests.length; i++)
		{
			Matcher match = url.matcher(tests[i]);
			
			if(match.matches())
			{
				System.out.println("Match");
				
				System.out.println(""
					+ "<a href=\"" + (match.group(3).equals("www.") ? "http://" : "") + match.group(2) + "\">" + 
					match.group(2) + "</a>");
			}
			else
			{
				System.out.println("No match");
			}
		}
	}

Output

Code Select

Match
<a href="http://www.google.com/howareyoutoday/">www.google.com/howareyoutoday/</a>
Match
<a href="http://www.google.com/asdf">http://www.google.com/asdf</a>
Match
<a href="http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new">http://www.x86labs.org/forum/index.php/topic,10305.msg130809.html#new</a>
Match
<a href="http://www.google.com/ig?hl=en">http://www.google.com/ig?hl=en</a>
No match
Match
<a href="http://www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search">www.google.com/search?source=ig&hl=en&q=regex&btnG=Google+Search</a>

Oh, this doesn't work for mailto: things, but I don't think it makes too much sense to look for those anyway. Wouldn't it make more sense to have another regex that searched for email addresses? In any case, it could be modified pretty easily to include it.

Camel · September 23, 2007, 06:09:35 PM

Thanks Sidoh, that the http prefix problem.

Code Select

	private static Pattern pattern = null;
	public String safeHtml(String in) {
		if(pattern == null)
			pattern = Pattern.compile("(.*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)");
		Matcher matcher = pattern.matcher(in); 
		
		if(matcher.matches())
			return safeHtml(matcher.group(1))
				+ "<a href=\"" + matcher.group(2) + "\">" + matcher.group(2) + "</a>"
				+ safeHtml(matcher.group(10));
		return in
			.replaceAll("&", "&amp;")
			.replaceAll("<", "&lt;")
			.replaceAll(">", "&gt;")
			.replaceAll("\n", "<br>\n")
			.replaceAll("  ", " &nbsp;");
	}

It still isn't picking URLs out of my MOTD string, though. I'll have to do some more debugging, but I think it has something to do with a newline immediately prefixing http://.

Sidoh · September 23, 2007, 06:49:06 PM

Oh, that's right. Good call. I totally forgot "." is everything but newline and I don't think \w (which is referenced by \b) matches linebreaks. I might be wrong on that, but messing around with things seems to suggest it.

Anyway, this little fix should take care of that:

Code Select


...
			"\nwww.google.com/howareyoutoday/",
...
		Pattern url = Pattern.compile(
		  "((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))(.*)"
		);
...
				System.out.println(""
					+ "<a href=\"" + (match.group(4).equals("www.") ? "http://" : "") + match.group(3) + "\">" + 
					match.group(3) + "</a>");

There might be something else

Camel · September 23, 2007, 07:15:53 PM

Cool, that works. I did the same thing to the trailer, as well. It still misses trailing slashes, but I don't care since it still works.

Code Select

	private static Pattern pattern = null;
	public String safeHtml(String in) {
		if(pattern == null)
			pattern = Pattern.compile("((.|\n)*?)\\b((([a-zA-Z]{3,6}://)|(www.)){1}([a-zA-Z0-9-.]+)([^-]\\.[a-zA-Z]{2,5}){1}((/\\S+){1}|\\s*?))((.|\n)*)");
		Matcher matcher = pattern.matcher(in); 
		
		if(matcher.matches())
			return safeHtml(matcher.group(1))
				+ "<a href=\"" + matcher.group(3) + "\">" + matcher.group(3) + "</a>"
				+ safeHtml(matcher.group(11));
		return in
			.replaceAll("&", "&amp;")
			.replaceAll("<", "&lt;")
			.replaceAll(">", "&gt;")
			.replaceAll("\n", "<br>\n")
			.replaceAll("  ", " &nbsp;");
	}

Joe · September 24, 2007, 05:59:18 PM

"wwhatwww.doyouwant.com",

That's totally a valid URL.

Sidoh · September 24, 2007, 06:19:23 PM

Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Yes, but the idea is that you ignore strings that aren't almost certainly URLs (meaning they're prefixed with one of www. or a protocol://).

If you were building a regular expression to validate URLs, then you'd want to include ones like that, but you'd also want to be much more conservative, explicitly specifying protocols, .extensions, etc.

Camel · September 24, 2007, 07:22:36 PM

Quote from: Joex86] link=topic=10305.msg130971#msg130971 date=1190671158]
"wwhatwww.doyouwant.com",

That's totally a valid URL.

Just because your browser's address bar knows what you mean, you can't conclude that it's a URL.

That is not, in fact, a URL, because it does not give you enough information to know what service it provides, or how to connect to it.

Clan x86

News:

[Java] URL detection

Camel

Sidoh

Camel

Sidoh

Camel

Joe

Sidoh

Camel