Skip to Content
Find More Like This
Return to Search

Isolating desired content, metadata, or both from social media

United States Patent

August 7, 2012
View the Complete Patent at the US Patent & Trademark Office
Pacific Northwest National Laboratory - Visit the Technology Commercialization Program Website
Desired content, metadata, or both can be isolated from the full content of social media websites having content-rich pages. Achieving this can include obtaining from the content-rich pages a language-independent representation having a hierarchical structure of nodes and then generating a node representation for each node. Feature vectors for the nodes are generated and a label is assigned to each node representation according to a schema. Assignment can occur by executing a trained classification algorithm on the feature vectors. The schema has schema elements and each schema element corresponds to a label. For each schema element, all node representations having matching labels are gathered and then one node representation is elected from among those with matching labels to be assigned to a schema element field in a template. The template can be applied to extract desired content, metadata, or both according to the schema from all the content-rich pages.
Bell; Eric B. (Richland, WA), Bohn; Shawn J. (Richland, WA), Cowell; Andrew J. (Kennewick, WA), Gregory; Michelle L. (Richland, WA), Marshall; Eric J. (Corvallis, OR), Payne; Deborah A. (Richland, WA)
Battelle Memorial Institute (Richland, WA)
13/ 036,776
February 28, 2011
STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT This invention was made with Government support under Contract DE-AC0576RLO1830 awarded by the U.S. Department of Energy. The Government has certain rights in the invention.