Tag Archive for 'parser'

Groovy : Parse All Soccer Players Info

I am new to groovy and am still getting used to the scripting way of thing coming from Java. So as a learning exercise I wrote up the following lines to parse information of all the soccer players from ESPN Soccernet. I have used the Jsoup library to get the document and parse it.

def leagues = [
        "http://soccernet.espn.go.com/clubs/_/league/eng.1/english-premier-league?cc=4716",
        "http://soccernet.espn.go.com/clubs/_/league/esp.1/spanish-la-liga?cc=4716",
        "http://soccernet.espn.go.com/clubs/_/league/ita.1/italian-serie-a?cc=4716",
        "http://soccernet.espn.go.com/clubs/_/league/ger.1/german-bundesliga?cc=4716",
        "http://soccernet.espn.go.com/clubs/_/league/fra.1/french-ligue-1?cc=4716",
]

leagues.each {leagueUrl ->
    Utils.getDocument(leagueUrl).select("table[class=tablehead]").get(0).select("td:eq(2)").select("a[href]").each {teamStatsUrl ->
        Utils.getDocument(teamStatsUrl.attr("abs:href")).select("tbody").each {playerGroup ->
            playerGroup.select("td:eq(1)").select("a[href]").each {playerLink ->
                Element playerProfile = Utils.getDocument(playerLink.attr("abs:href")).select("div.profile").get(0)
                String playerName = playerProfile.select("h1").text()

                def profilePrperties = [:]
                playerProfile.select("li").each {item ->
                    String[] itemProperties = item.text().split(":")
                    if(itemProperties.size() == 1) profilePrperties.get("teams", []).add(itemProperties[0])
                    else profilePrperties[itemProperties[0]] = itemProperties[1]
                }
                println playerName + " " + profilePrperties
            }
        }
    }
}

All that Utils.getDocument(url) here does is to call Jsup.connect(url).get() within a loop with number of retries set to 5. The script produces output as follows:

Ramires [Full Name: Ramires, Squad No: 7, Position: Midfielder, Age: 24, Birth Date: Mar 24, 1987, Birth Place: Barra do PiraĆ­, Rio de Janeiro, Brazil, Height: 5' 11'' (1.80m), Weight: 73 kg, teams:[Brazil, Chelsea]]
Frank Lampard [Squad No: 8, Position: Midfielder, Age: 33, Birth Date: Jun 21, 1978, Birth Place: Romford, Height: 6' 0" (1.83m), Weight: 174 lbs (78.7 kg), teams:[England, Chelsea]]
Fernando Torres [Full Name: Fernando Torres, Squad No: 9, Position: Forward, Age: 27, Birth Date: Mar 20, 1984, Birth Place: Fuenlabrada, Madrid, Height: 6' 1'' (1.85m), Weight: 174 lbs (78.7 kg), teams:[Spain, Chelsea]]
John Mikel Obi [Squad No: 12, Position: Midfielder, Age: 24, Birth Date: Apr 22, 1987, Birth Place: Jos, Nigeria, Height: 5' 11'' (1.80m), Weight: 179 lbs (81.3 kg), teams:[Chelsea]]
Raul Meireles [Full Name: Raul Meireles, Squad No: 16, Position: Midfielder, Age: 28, Birth Date: Mar 17, 1983, Birth Place: Porto, Portugal, Height: 1.79m, Weight: 65 kg, teams:[Chelsea, Liverpool, Portugal]]
Branislav Ivanovic [Squad No: 2, Position: Defender, Age: 27, Birth Date: Feb 22, 1984, Birth Place: Sremska Mitrovica, Yugoslavia, Height: 6' 2" (1.88m), Weight: 86 kg, teams:[Serbia, Chelsea]]
Juan Mata [Full Name: Juan Mata, Squad No: 10, Position: Forward, Age: 23, Birth Date: Apr 28, 1988, Birth Place: Burgos, Spain, Height: 1.70m, Weight: 61 kg, teams:[Spain, Valencia, Chelsea, Spain U21]]

Popularity: 2% [?]

Parsing XML Using Castor

Castor provides 3 ways to parse an xml file into java objects. I have used two of those, and am writing this post to give an introduction into the method.

The first way is to create java classes for each of the elements and then use the marshall and unmarshall methods to parse the xml file. Castor uses introspection techniques to map elements with fields of the java class. I have not used this method, and will not go any further.

The second and perhaps the easiest way is to give Castor the schema file and let it generate the java classes. Say you have the following schema file:

If you let Castor generate the java source files itself, the following files are generated: The Plgen class for the root element:

The Playlist class which is the child of Plgen:

And finally, the ExtensionType enum:

But I ran into trouble when I wanted to add additional behavior to the classes generated by Castor. As an example, say, I want my Playlist class to implement an interface XYZ which has a method doXYZprocess(). If I am to modify the Playlist class to make it implement the said interface, then if I make any change to the schema file in future I’ll be in trouble. I’ll have to redo all the additions. Now this is a painful task.

An alternative was to create wrapper class around the Castor generated classes and use the wrappers instead. Well, this would work, but I wanted something smarter rather than such duplication.

This is where the third method of parsing comes into picture. It addresses the exact issue as mentioned above. The idea is to use user written classes for marshalling rather than Castor generated ones. One could write a class having lots of methods and fields, bind it to an element of an xml file and have it populated by Castor while parsing the xml. Castor provides this functionality through a Mapping.

A Mapping is another xml file which instructs Castor about the Java classes and the fields that are to be used for parsing. A great degree of freedom is provided in the writing of mapping. One could specifically tell Castor to use certain getters and setters, or leave it to Castor to populate the class fields by introspection.

An example of mapping file for the above schema file would be:

Castor interprets this mapping as the root element being plgen which would get mapped to the class Plgen. This root element can contain a list of Playlist elements. So the getter and setter that Castor calls will be

public ArrayList<Playlist> getPlaylist();

public void setPlaylist(ArrayList<Playlist>);

So our Plgen class above needs to contain these two methods. Similarly, the Playlist class needs to have the following two methods to set the extensionType field:

public String getExtensionType();

public void setExtensionType(String);

Note that I am taking the enum value as a string and will use the setter to set the Enum from a string. My getter likewise will return the Enum.toString().

If you do not want the default getters and setters to be used by Castor, then you have to specify the getter and setter as attributes of the field element in t he mapping.

Castor has compiled a help page to understand the mapping option. You may like to read it here:

Castor XML Mapping

I really liked the mapping feature. It gives me a lot of freedom to write Java classes and then use the mapping to parse any xml file into these classes. This obviates the need to understand the DOM parsing model, of iterating through the xml tree and extracting the values required. The game has been simplified to writing a mapping file and letting Castor create the java objects.

I hate xmls’ NO more. :)

Popularity: 29% [?]

Parsing XML With Castor XML

After lot of trying I finally managed to get Castor tools working to parse XML files. And now that I have it working for me, I am always going to use it for XML parsing. It makes things so much simpler and easier.

Castor takes in a xml file and unmarshals it into Java objects. There are three ways to associate Java Classes with XML elements.

  • The first one is introspection. Given the class to the Unmarshelar, Castor populates the instance fields from XML.
  • The second is to use bindings defined by user.
  • The third is to use the XML Code Generator tool and have it generate Java Classes.

Of course, I used the third option, and that is the one I am going to mention here.

Castor jars can be downloaded from the Castor Project.It has a few dependencies. So I decided top use Maven. And I am already using Eclipse.

I imported the castor-code-generation jar. Then created a pom.xml file. I added a plugin to it:

      <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>castor-maven-plugin</artifactId>
        <version>2.0</version>
        <configuration>
          <schema>config.xsd</schema>
          <packaging>pba.plgen.xml.binding</packaging>
          <properties>generation.properties</properties>
        </configuration>
        <executions>
          <execution>
            <goals>
              <goal>generate</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

Where the properties file is:

# Specifies whether the sources generated should be source compatible with
# Java 1.4 or Java 5.0. Legal values are "1.4" and "5.0".  When "5.0" is
# selected, generated source will use Java 5 features such as generics and
# annotations.
# Defaults to "5.0".
#
org.exolab.castor.builder.javaVersion=5.0

# Set to true if you want to have an equals() and
# hashCode() method generated for each generated class;
# false by default
org.exolab.castor.builder.equalsmethod=true

# Specifies whether automatic class name conflict resolution
# should be used or not; defaults to false.
org.exolab.castor.builder.automaticConflictResolution=true

# Property specifying whether extra members/methods for extracting XML schema
# documentation should be made available; defaults to false
org.exolab.castor.builder.extraDocumentationMethods=false

Right Click -> Run As -> Maven generate-sources

And all the source code is generated.

Suppose that the root element is plgen. Then a class Plgen is generated. All the child nodes of plgen become instance variables of Plgen. To unmarshal the xml file,

        public void parse() {
		try {
			m_plgen = Plgen.unmarshal(new FileReader(m_configFilePath));
		} catch (MarshalException e) {
			e.printStackTrace();
		} catch (ValidationException e) {
			e.printStackTrace();
		} catch (FileNotFoundException e) {
			e.printStackTrace();
		}
	}

Well, I haven’t written this properly. But, its enough to make me remember how to get castor working again. My mission accomplished.

Popularity: 8% [?]

I Hate XMLs

XMLs have never been my favourite. Attributes, Elements, Values, Child nodes … aargh, they are so confusing. And yet, configuring an application using an xml file seems so simple.

I have written my own XML parsers using the DOM. The whole document gets mapped to a tree structure which you can iterate over and get any value you want. But I find it to be very restrictive, and most of my code depends on the structure of my xml file. I would like to make my xml parser independent of the xml file structure.

I came across Castor XML. It has tools to read the schema and generate Java Classes out of it. Then the xml file is parsed and objects are created. Now this is good. I think this should be much easier than iterating a tree. But !! I have spent a lot of time trying to get Castor working. Its dependencies !!! Damn.

Time to get back to xml parsing again.

Popularity: 3% [?]